## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Initial observations are:
There are 1599 observations of 13 variables. X appears to be the unique identifier. Quality is an ordered, categorical, discrete variable. The values ranged only from 3 to 8, with a mean of 5.6 and median of 6. From the variable descriptions, it appears that fixed.acidity, volatile.acidity and free.sulfur.dioxide, total.sulfur.dioxide may be subsets of each other.
Since we’re primarily interested in categorizing/modelling wines based on quality, it would make sense to convert X and quality into factor variables.
## Factor w/ 1599 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
Plotting histogram and box plots(side by side) for count against each variable to check the distrubution of data. Box plots show a very clear reprentation of distribution of count for each variable, hence plotting box plots also.
Since fixed.acidity vs count histogram/box-plot is positively skwed, re-plotted to see distribution by transforming fixed.acidity to log10(fixed.acidity) vs count.“” Histogram of log10(fixed.acidity) vs count shows more normalized data which is suited for modelling.
Since volatile.acidity vs count histogram/box-plot is positively skewd, re-plotted to see distribution by transforming volatile.acidity to log10(volatile.acidity) vs count. Histogram of log10(volatile.acidity) vs count shows more normalized data which is suited for modelling.
Since value of citric acid is 0 for many data points, removing those data points for re-plotting histogram for distibution of data.
citric acid seem to be an added ingredient as counts don’t seem to follow any order
Since residual.sugar vs count histogram/box-plot is positively skewd, re-plotting to see distribution by transforming chlorides to log10(residual.sugar) vs count. Histogram of log10(residual.sugar) vs count shows more normalized data which is suited for modelling.
Since chlorides vs count histogram/box-plot is positively skewd, re-plotting to see distribution by transforming chlorides to log10(chlorides) vs count. Histogram of log10(chlorides) vs count shows more normalized data which is suited for modelling.
Since free.sulphur.dioxide vs count histogram/box-plot is positively skewd, re-plotting to see distribution by transforming free.sulphur.dioxide to log10(free.sulphur.dioxide) vs count. Histogram of log10(free.sulphur.dioxide) vs count shows more normalized data which is suited for modelling.
Since total.sulphur.dioxide vs count histogram/box-plot is positively skewd, re-plotting to see distribution by transforming free.sulphur.dioxide to log10(total.sulphur.dioxide) vs count. Histogram of log10(total.sulphur.dioxide) vs count shows more normalized data which is suited for modelling.
It appears that density is normally distributed, with few outliers.
It appears that pH is normally distributed, with few outliers.
Since total.sulphur.dioxide vs count histogram/box-plot is positively skewd, re-plotted to see distribution by transforming free.sulphur.dioxide to log10(total.sulphur.dioxide) vs count. Histogram of log10(total.sulphur.dioxide) vs count shows more normalized data which is suited for modelling.
Since alcohol vs count histogram/box-plot is positively skewd, re-plotted to see distribution by transforming alcohol to log10(alcohol) vs count. Histogram of log10(alcohol) vs count shows more normalized data which is suited for modelling.
It appears that density and pH are normally distributed, with few outliers.
Fixed and volatile acidity, sulfur dioxides, and alcohol seem to be long-tailed. The volatile acidity distribution appears bimodal at 0.4 and 0.6 with some outliers in the higher ranges.
Qualitatively, sulphates, residual sugar and chlorides have extreme outliers.
Citric acid appeared to have a large number of zero values. I’m curious whether this is actually zero, or if it is a case of non-reporting. After reading about wine making it became clear why citric acid quantity is zero in many wines. It’s because citric acid is an added ingredient to enhance the acitity of wines, not all wine makers add it.
Categorizing quality of wine as poor, average, best, based on range of quality of wine would be user friendly for analysis to check no. of wine obervations per category via histogram
## poor average best
## 63 1319 217
Adding new variable called total.acidity as fixed.acidity, volatile.acidity and citric.acid are subsets of actual acidity
Density and pH are normally distributed. Rest all of the variables display positive skew.
If the distribution of a variable has a positive skew (with long tailed histogram), taking a logarithm of the variable sometimes helps fitting the variable into a model. Log transformations make positively skewed distribution more normal as observed in above histograms. Also, If most of the counts/wine data are represented in a certain range, it’s better to consider that range for modeling.
The main feature in the data is quality. I’d like to explore which features determine the quality of wines.
The variables related to acidity (fixed, volatile, citric.acid and pH) might explain it’s affect on quality of wines. I suspect the different acid concentrations might alter the taste of the wine. Also, residual.sugar affects how sweet a wine is and might also have an influence on taste.
I created an ordered factor: quality - classifying each wine sample as ‘poor’, ‘average’, or ‘best’.
Upon further examination of the data set documentation, it appears that fixed.acidity and volatile.acidity are different types of acids; tartaric acid and acetic acid. I decided to create a combined variable, total.acidity, containing the sum of tartaric, acetic, and citric acid.
I addressed the distributions in the ‘Distributions’ section. Boxplots are better suited in visualizing the outliers, hence plotted boxplots of each variable.
In univariate analysis, I chose not to tidy or adjust any data, except plotted select few (the ones that showed positive skew) on logarithmic scales, also removed counts from ploting for citric.acid=0.
Citric.acid stood out from the other distributions. It had (apart from some outliers) an rectangularly looking distribution which given the wine quality distribution seems very odd.
Let’s check correlation of all variables to see how each variable influences quality
From correlation plot, Quality is most correlated: with alcohol and volatile.acidity, followed by sulphates and citric acid.
The other variables that are highly correlated are: fixed.acidity and volatile.acidity, fixed.acidity and free.sulphur.dioxide, fixed.acidity and density. Alcohol is correlated with density. Volatile.acidity is correlated with fixed.acidity
This means, quality is influenced by alcohol and volatile.acidity. Alcohol is correlated to density. Volatile.acidity is correlated to fixed.acidity. This means density and fixed.acidity also affect quality indirectly.
Above boxplot shows that alcohol content increases with increase in quality of wines.
Above boxplot shows that volatile.acidity decreases with increase in quality of wines.
Above boxplot shows sulphur content increases with increase in quality of wine.
Above boxlplot shows increase in citirc acid goes along with increase in quality.
Bivariate boxplots, with X as quality, is interesting in showing trends with wine quality. From exploring these plots, it seems that a ‘best’ wine generally has these trends: lower volatile acidity (acetic acid) and higher alcohol, sulphates and citric acid.
Interestingly, it appears that different types of acid affect wine quality differently; total.acidity showed that the presence of volatile (acetic) acid reduced quality.
Strongest relationship is seen with alcohol content on quality of wine. Second strongest relationship is seen with alcohol content and volatile.acidity, more the alcohol and less the volatile.acidity , the better seems to be the wine. There might be other variables interaction in predicting the quality of wine, which can be analyzed by multivariate analysis.
Checking correlations of each variable with quality of wine applying cor.test:
## fixed.acidity volatile.acidity citric.acid
## 0.12405165 -0.39055778 0.22637251
## total.acidity log10.residual.sugar log10.chlordies
## 0.10375373 0.02353331 -0.17613996
## free.sulfur.dioxide total.sulfur.dioxide density
## -0.05065606 -0.18510029 -0.17491923
## pH log10.sulphates alcohol
## -0.05773139 0.30864193 0.47616632
Correlations nos. show that the correlation between quality of wine and alcohol content is highest, with volatile.acidity in second place (which is same as observed before with ggcorr plot), suplhates in third place and citric acid in fourth place. So, it will be interesting to see multivariate scatterplots between these variables show any combined effects on quality.
sulphur content is more in better quality of wine, better quality wines have higher alcohol content.
The poor quality and best quality wines show similar trend, whereas rest quality of wines show an opposite trend. This doesn’t make sense. This means volatile.acidity and alcohol together don’t provide us a reliable trend.
The poor quality and best quality wines show somewhat similar trend, whereas remaining qualities of wine show an opposite trend. This doesn’t make sense. This means citric acid and alcohol together don’t provide us a reliable trend.
All the qualities of wines show that as citric acid increases, volatile acidity decreases for a particular quality of wine. Whereas, best quality wine has higher volatile acidity than second best quality of wine, this doesn’t make sense. This means that data is not reliable. There is some other factor in the best quality wine that we don’t know of that is showing up in this scatterplot.
Sulphur content and citric acid together doesn’t provide any valuable insight here.
Since acidity and pH are related to each other. The higher the acidity, the lower is the pH value of a liquid. So, let’s derive a regression model of acidity and pH to predict pH from acidity. Let’s boxplot the error as pH between observed pH and expected pH. If the boxplot shows more error in certain quality of wine, it means the observed data might not be the only variable affecting quality.
There is more pH.error in the poor quality wines, which means there are other variables like contaminations causing the quality of wine to be poor.
Generating a predictive linear model:
##
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = training_data)
## m2: lm(formula = as.numeric(quality) ~ alcohol + log10(sulphates),
## data = training_data)
## m3: lm(formula = as.numeric(quality) ~ alcohol + log10(sulphates) +
## log10(volatile.acidity), data = training_data)
## m4: lm(formula = as.numeric(quality) ~ alcohol + log10(sulphates) +
## log10(volatile.acidity) + citric.acid, data = training_data)
## m5: lm(formula = as.numeric(quality) ~ alcohol + log10(sulphates) +
## pH, data = training_data)
##
## ==================================================================================
## m1 m2 m3 m4 m5
## ----------------------------------------------------------------------------------
## (Intercept) -0.259 0.337 0.257 0.317 1.851***
## (0.232) (0.234) (0.226) (0.226) (0.500)
## alcohol 0.374*** 0.352*** 0.312*** 0.311*** 0.365***
## (0.022) (0.022) (0.021) (0.021) (0.022)
## log10(sulphates) 1.884*** 1.346*** 1.472*** 1.712***
## (0.223) (0.223) (0.228) (0.227)
## log10(volatile.acidity) -1.304*** -1.523***
## (0.147) (0.172)
## citric.acid -0.330*
## (0.136)
## pH -0.511***
## (0.149)
## ----------------------------------------------------------------------------------
## R-squared 0.2 0.3 0.3 0.3 0.3
## adj. R-squared 0.2 0.3 0.3 0.3 0.3
## sigma 0.7 0.7 0.7 0.7 0.7
## F 285.4 189.0 162.3 123.8 131.3
## p 0.0 0.0 0.0 0.0 0.0
## Log-likelihood -1037.2 -1002.6 -964.7 -961.8 -996.7
## Deviance 488.4 454.3 419.9 417.3 448.8
## AIC 2080.4 2013.1 1939.5 1935.6 2003.4
## BIC 2095.0 2032.6 1963.8 1964.8 2027.8
## N 959 959 959 959 959
## ==================================================================================
Notice I did not include pH in the same formula with the acids to avoid colinearity problems
Low R square value (calculated in multivariable analysis) indicates that the linear model is not reliable for predicting quality of wines.
It clearly shows best quality wines are high on sulphates and alcohol contents.
It is interesting to see that all these variables are not enough to produce a linear model to predict quality of wine with sufficient accuracy as seen from the linear model developed in the multivariate plot analysis.
This chart revealed how alcohol has a big influence on the quality of wines. Next time I’m the supermarket, it’s the first thing I’m going to look for.
High alcohol contents and high sulphate concentrations combined seem to produce better wines.
The error in the predictions mean that there are missing variables that account for quality of wines, this data doesn’t seem very reliable in predicting quality of wines.
The wine data set contains information on the chemical properties of a selection of wines. It also includes sensorial data (wine quality).
I started by looking at the individual distributions of the variables, trying to get a feel for each one. Single variable analysis helped transform data appropriately to represent a normalized distribution for developing a linear model.
Bivariable variables analysis displayed some strong noticable relationships between each variable and quality of wine. It was clear from bivariate analysis that alcohol content in wine is strong predictor of wine quality. Also, volatile.acidity strongly inflences quality of wine. Best quality wines have higher alcohol content and lower volatile.acidity.
Since acidity and pH are related to each other, a regression model of acidity and pH was derived to predict pH from acidity. It’s boxplot showed the error between observed pH and expected pH. The boxplot shows more error in poor quality of wines, it means the poor quality wines have some contamination which might be hurting the quality.
On the final part of the analysis, I tried using multivariate plots to investigate if there were interesting combinations of variables that might affect quality. In the end, the produced model could not explain much of the variance in quality. The data is insufficient to produce a better fitting model to predict wine quality with sufficient accuracy.
The difficulty faced in producing a better predictive model is insufficient data as quality of wine largly depends upon process of making it. Tannins and yeast used in making are important aspects. Also, aging of wine is an important factor in predicting the quality of wine. All these critical aspects that affect quality of wines are not available in this data, hence poor predictive model.
For future studies, it would be interesting to measure more data variables that affect wine quality for modelling wine quality.